Method and apparatus with neural network processing

ABSTRACT

A neural network device includes a shift register circuit, a control circuit, and a processing circuit. The shift register circuit includes registers configured to, in each cycle of cycles, transfer stored data to a next register and store new data received from a previous register to a current register. The control circuit is configured to sequentially input data of input activations included in an input feature map into the shift register circuit in a preset order. The processing circuit, includes crossbar array groups that receive input activations from at least one of the registers and perform a multiply-accumulate (MAC) operation with respect to the received input activation and weights, is configured to accumulate and add at least some operation results output from the crossbar array groups in a preset number of cycles to obtain an output activation in an output feature map.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit under 35 U.S.C. § 119 to KoreanPatent Application No. 10-2020-0089166, filed on Jul. 17, 2020, in theKorean Intellectual Property Office, the entire disclosure of which isincorporated herein by reference for all purposes.

BACKGROUND 1. Field

The present disclosure relates to method and apparatus with neuralnetwork processing.

2. Description of Related Art

A neuromorphic processor may be, or used in, a neural network devicethat drives various neural networks, such as a Convolutional NeuralNetwork (CNN), a Recurrent Neural Network (RNN), and a FeedforwardNeural Network (FNN), and may also be used for data classification,image recognition, etc.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

In one general aspect, a neural network device includes a shift registercircuit, a control circuit, and a processing circuit. The shift registercircuit includes registers configured to, in each cycle of cycles,transfer stored data to a next register and store new data received froma previous register to a current register. The control circuit isconfigured to sequentially input data of input activations included inan input feature map into the shift register circuit in a preset order.The processing circuit, includes crossbar array groups that receiveinput activations from at least one of the registers and perform amultiply-accumulate (MAC) operation with respect to the received inputactivation and weights, is configured to accumulate and add at leastsome operation results output from the crossbar array groups in a presetnumber of cycles to obtain an output activation in an output featuremap.

The control circuit may be further configured to receive a 1-bit zeromark on each of the cycles, and, in response to the value of the zeromark being 1, control the crossbar array groups to omit a MAC operationwith respect to input activations corresponding to the zero mark.

Crossbar arrays included in one crossbar array group of the crossbararray groups may share a same input activation.

Each of the crossbar arrays may include row lines, column linesintersecting the row lines, and memory cells. The memory cells aredisposed at the intersections of the row lines and the column lines, andconfigured to store the weights included in a weight kernel.

The processing circuit may be further configured to obtain a firstoutput activation using an operation result output from one of thecrossbar arrays, and obtain a second output activation using anoperation result output from another of the crossbar arrays.

A number of the crossbar arrays included in the one crossbar array groupmay correspond to a width of a weight kernel.

A number of registers that transfer input activation to the crossbararray groups from the registers may correspond to a height of a weightkernel.

The processing circuit may be further configured to select at least someof the operation results output from the crossbar array groups, convertthe selected operation results into a 2's complement format, andaccumulate and add the converted operation results to obtain the outputactivation.

The processing circuit may include an output line through which theoutput activation is output, and the output line may correspond to anoutput of one of a plurality of layers constituting a neural network,and may be directly connected to an input line of a next layer.

The next layer may include either one or both of a convolution layer anda pooling layer.

In another general aspect, an operating method of a neural networkdevice includes sequentially inputting input activations included in aninput feature map into a shift register circuit in a preset order,receiving an input activation of the input activations from at least oneof a plurality of registers of the shift register circuit from acorresponding crossbar array group of crossbar array groups andperforming a multiply-accumulate (MAC) operation on the received inputactivation and weights, and obtaining an output activation included inan output feature map by accumulating and adding at least some of thecalculation results output from the crossbar array groups in units of apreset number of cycles.

The operating method may further include receiving a 1-bit zero mark oneach cycle of the sequentially inputting of the input activations, andin response to the value of the zero mark being 1, controlling thecrossbar array groups to omit the MAC operation with respect to inputactivations corresponding to the zero mark.

Crossbar arrays included in one crossbar array group of the crossbararray groups may share a same input activation

Each of the crossbar arrays may include row lines, column linesintersecting the row lines, and memory cells disposed at theintersections of the row lines and the column lines, and configured tostore the weights of a weight kernel.

The operating method may further include obtaining a first outputactivation using an operation result output from one of the crossbararrays, and obtaining a second output activation using an operationresult output from another crossbar array of the crossbar arrays.

A number of the crossbar arrays included in the one crossbar array groupmay correspond to a width of a weight kernel.

A number of registers that transfer input activation to the crossbararray groups from the plurality of registers may correspond to a heightof a weight kernel.

The obtaining the output activation may include selecting at least someoperation results output from the crossbar array groups, converting theselected operation results into a 2's complement format, andaccumulating and adding the converted operation results.

The operating method may further include outputting the outputactivation via an output line. The output line may correspond to anoutput of one of a plurality of layers constituting a neural network,and may be directly connected to an input line of a next layer.

The next layer may include either one or both of a convolutional layerand a pooling layer.

In another general aspect, a neural network device includes a shiftregister circuit and a processing circuit. The shift register circuitincludes registers configured to sequentially transfer input activationsof an input feature map to registers. The processing circuit, includescrossbar array groups configured to receive input activations from asubset of the registers, perform a multiply-accumulate (MAC) operationon the received input activation and weights, and output activation ofan output feature map by accumulating and adding calculation resultsoutput from the crossbar array groups in predetermined number of cycles.

The registers may be further configured to receive a 1-bit zero mark oneach cycle of the sequentially transferring of the input activations,and in response to the value of the zero mark being 1, may control thecrossbar array groups to omit the MAC operation with respect to inputactivations corresponding to the zero mark.

Crossbar arrays included in one crossbar array group of the crossbararray groups may share a same input activation

Each of the crossbar arrays may include row lines, column linesintersecting the row lines, and memory cells, disposed at theintersections of the row lines and the column lines, configured to storethe weights of a weight kernel.

A number of the crossbar arrays included in the one crossbar array groupmay correspond to a width of a weight kernel.

The outputting of activation may include selecting at least someoperation results output from the crossbar array groups, converting theselected operation results into a 2's complement format, andaccumulating and adding the converted operation results.

Other features and aspects will be apparent from the following detaileddescription, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram explaining an architecture of a neural networkaccording to one or more embodiments.

FIG. 2 is a diagram explaining an operation performed in a neuralnetwork according to one or more embodiments.

FIG. 3 is a diagram illustrating an in-memory computing circuitaccording to one or more embodiments.

FIG. 4 is a diagram illustrating a configuration of a processing blockincluded in a neural network device according to one or moreembodiments.

FIG. 5 is a diagram illustrating a circuit structure of a neural networkdevice according to one or more embodiments.

FIG. 6 is a diagram explaining a process of performing a neural networkoperation by a neural network device according to one or moreembodiments.

FIG. 7 is a diagram explaining a process of performing pooling andactivation function operations by a neural network device according toone or more embodiments.

FIG. 8 is a block diagram illustrating a configuration of an electronicsystem according to one or more embodiments.

FIG. 9 is a flowchart illustrating an operating method of a neuralnetwork device according to one or more embodiments.

Throughout the drawings and the detailed description, the same referencenumerals refer to the same elements. The drawings may not be to scale,and the relative size, proportions, and depiction of elements in thedrawings may be exaggerated for clarity, illustration, and convenience.

DETAILED DESCRIPTION

The following detailed description is provided to assist the reader ingaining a comprehensive understanding of the methods, apparatuses,and/or systems described herein. However, various changes,modifications, and equivalents of the methods, apparatuses, and/orsystems described herein will be apparent after an understanding of thedisclosure of this application. For example, the sequences of operationsdescribed herein are merely examples, and are not limited to those setforth herein, but may be changed as will be apparent after anunderstanding of the disclosure of this application, with the exceptionof operations necessarily occurring in a predetermined order. Also,descriptions of features that are known after understanding of thedisclosure of this application may be omitted for increased clarity andconciseness.

The features described herein may be embodied in different forms, andare not to be construed as being limited to the examples describedherein. Rather, the examples described herein have been provided merelyto illustrate some of the many possible ways of implementing themethods, apparatuses, and/or systems described herein that will beapparent after an understanding of the disclosure of this application.

Throughout the specification, when an element, such as a layer, region,or substrate, is described as being “on,” “connected to,” or “coupledto” another element, it may be directly “on,” “connected to,” or“coupled to” the other element, or there may be one or more otherelements intervening therebetween. In contrast, when an element isdescribed as being “directly on,” “directly connected to,” or “directlycoupled to” another element, there can be no other elements interveningtherebetween.

As used herein, the term “and/or” includes any one and any combinationof any two or more of the associated listed items.

Although terms such as “first,” “second,” and “third” may be used hereinto describe various members, components, regions, layers, or sections,these members, components, regions, layers, or sections are not to belimited by these terms. Rather, these terms are only used to distinguishone member, component, region, layer, or section from another member,component, region, layer, or section. Thus, a first member, component,region, layer, or section referred to in examples described herein mayalso be referred to as a second member, component, region, layer, orsection without departing from the teachings of the examples.

Spatially relative terms such as “above,” “upper,” “below,” and “lower”may be used herein for ease of description to describe one element'srelationship to another element as shown in the figures. Such spatiallyrelative terms are intended to encompass different orientations of thedevice in use or operation in addition to the orientation depicted inthe figures. For example, if the device in the figures is turned over,an element described as being “above” or “upper” relative to anotherelement will then be “below” or “lower” relative to the other element.Thus, the term “above” encompasses both the above and below orientationsdepending on the spatial orientation of the device. The device may alsobe oriented in other ways (for example, rotated 90 degrees or at otherorientations), and the spatially relative terms used herein are to beinterpreted accordingly.

The terminology used herein is for describing various examples only, andis not to be used to limit the disclosure. The articles “a,” “an,” and“the” are intended to include the plural forms as well, unless thecontext clearly indicates otherwise. The terms “comprises,” “includes,”and “has” specify the presence of stated features, numbers, operations,members, elements, and/or combinations thereof, but do not preclude thepresence or addition of one or more other features, numbers, operations,members, elements, and/or combinations thereof.

The features of the examples described herein may be combined in variousways as will be apparent after an understanding of the disclosure ofthis application. Further, although the examples described herein have avariety of configurations, other configurations are possible as will beapparent after an understanding of the disclosure of this application.

Terminologies used herein are selected as commonly used by those ofordinary skill in the art in consideration of functions of the currentembodiment, but may vary according to the technical intention,precedents, or a disclosure of a new technology. Also, in particularcases, some terms are arbitrarily selected by the applicant, and in thiscase, the meanings of the terms will be described in detail atcorresponding parts of the specification. Accordingly, the terms used inthe specification should be defined not by simply the names of the termsbut based on the meaning and contents of the whole specification.

FIG. 1 is a diagram explaining an architecture of a neural networkaccording to one or more embodiments.

In FIG. 1, the neural network 1 may be represented by a mathematicalmodel by using nodes and edges. The neural network 1 may include anarchitecture of a deep neural network (DNN) or n-layers neural networks.The DNN or n-layers neural networks may correspond to convolutionalneural networks (CNNs), recurrent neural networks (RNNs), deep beliefnetworks, restricted Boltzman machines, etc. For example, the neuralnetwork 1 may be implemented as a CNN, but is not limited thereto. Theneural network 1 of FIG. 1 may correspond to some layers of the CNN.Accordingly, the neural network 1 may correspond to a convolutionallayer, a pooling layer, or a fully connected layer, etc. of a CNN.However, for convenience, in the following descriptions, it is assumedthat the neural network 1 corresponds to the convolutional layer of theCNN.

In such a convolution layer, a first feature map 1 FM1 may correspond toan input feature map and a second feature map FM2 may correspond to anoutput feature map. The feature map may denote a data set representingvarious characteristics of input data. The first and second feature mapsFM1 and FM2 may be a high-dimensional matrix of two or more dimensions,and have respective activation parameters. When the first and secondfeature maps FM1 and FM2 correspond to, for example, three-dimensionalfeature maps, the first and second feature maps FM1 and FM2 have a widthW (or column), a height H (or row), and a depth C. At this point, thedepth C may correspond to the number of channels.

In a convolution layer, a convolution operation with respect to thefirst feature map FM1 and a weight map WM may be performed, and as aresult, the second feature map FM2 may be generated. The weight map WMmay filter the first feature map FM1 and is referred to as a weightfilter or weight kernel. In one example, a depth of the weight map WM,that is, the number of channels is the same as the depth of the firstfeature map FM1, that is, the number of channels. The weight map WM isshifted by traversing the first feature map FM1 as a sliding window. Ineach shift, weights included in the weight map WM may respectively bemultiplied and added to all feature values in a region overlapping withthe first feature map FM1. As the first feature map FM1 and the weightmap WM are convolved, one channel of the second feature map FM2 may begenerated.

In FIG. 1, although one weight map WM is depicted, a plurality ofchannels of the second feature map FM2 may be generated by respectivelyconvolving a plurality of weight maps with the first feature map FM1.The second feature map FM2 of the convolution layer may then be used asan input feature map of the next layer. For example, the second featuremap FM2 may be an input feature map of a pooling layer. But the presentembodiment is not limited thereto.

FIG. 2 is a diagram explaining an operation performed in a neuralnetwork 2 according to one or more embodiments.

In FIG. 2, the neural network 2 may have a structure that incudes inputlayers, hidden layers, and output layers, and may perform operationsbased on received input data (for example, I₁ and I₂), and may generateoutput data (for example, O₁ and O₂) based on a result of theoperations.

As described above, the neural network 2 may be a DNN or an n-layerneural network including two or more hidden layers. For example, asillustrated in FIG. 2, the neural network 2 may be a DNN including aninput layer (Layer 1), two hidden layers (Layer 2 and Layer 3), and anoutput layer (Layer 4). When the neural network 2 is implemented as aDNN architecture, the neural network 2 includes a further large numberof layers capable of processing valid information, and thus, the neuralnetwork 2 may process a large number of complex data sets than a neuralnetwork having a single layer. However, although the neural network 2 isillustrated as including four layers, but this is only an example, andthe neural network 2 may include a lesser or greater number of layers,or a lesser or greater number of channels. That is, the neural network 2may include layers of various structures different from thoseillustrated in FIG. 2.

Each of the layers included in the neural network 2 may include aplurality of channels. A channel may correspond to a plurality ofartificial nodes, known as neurons, processing elements (PEs), units, orsimilar terms. For example, as illustrated in FIG. 2, the Layer 1 mayinclude two channels (nodes), and each of the Layer 2 and Layer 3 mayinclude three channels. However, this is only an example, and each ofthe layers included in the neural network 2 may include various numbersof channels (nodes).

The channels included in each of the layers of the neural network 2 maybe connected to each other to process data. For example, one channel mayreceive data from other channels for an operation and output theoperation result to other channels.

Each of inputs and outputs of each of the channels may be referred to asan input activation and an output activation. That is, the activationmay be an output of one channel and may be a parameter corresponding toan input of channels included in the next layer.

Each of the channels may determine its own activation based onactivations received from channels included in the previous layer andappropriate weights. The weights are parameters used to operate anoutput activation in each channel, and may be values assigned toconnection relationships between channels.

Each of the channels may be processed by, for example, a hardwarecomputational unit or processing element that outputs an outputactivation by receiving an input, and an input-output of each of thechannels may be mapped. For example, when a is an activation function,w_(jk) ^(i) is a weight from a k^(th) channel included in an (i−1)^(th)layer to a j^(th) channel included in an i^(th) layer, b_(j) ^(i) is abias of the j^(th) channel included in the i^(th) layer, and a_(j) ^(i)is an activation of the j^(th) channel in the i^(th) layer, theactivation a_(j) ^(i) may be calculated by using Equation 1 below.

$\begin{matrix}{a_{j}^{i} = {\sigma\left( {{\sum\limits_{k}\left( {w_{jk}^{i} \times a_{k}^{i - 1}} \right)} + b_{j}^{i}} \right)}} & {{Equation}\mspace{14mu} 1}\end{matrix}$

As shown in FIG. 2, the activation of a first channel CH1 of the secondlayer Layer 2 may be expressed as a₁ ². Also, a₁ ² may have a value ofa₁ ²=σ(w_(1,1) ²×a₁ ¹+w_(1,2) ²×a₂ ¹+b₁ ²) according to the Equation 1.The activation function a may be a Rectified Linear Unit (ReLU), but thepresent embodiment is not limited thereto. For example, the activationfunction a may be sigmoid, hyperbolic tangent, Maxout, etc.

As described above, in the neural network 2, a large number of data setsare exchanged between a plurality of interconnected channels, and anumber of computational processes are performed through layers. In thisoperation, a large number of MAC (multiply-accumulate) operations areperformed, and a large number of memory access operations must betypically performed to load activations and weights, which are operandsof MAC operations, at an appropriate time.

On the other hand, a typical digital computer uses a Von Neumannstructure in which a computation unit and a memory are separated andincludes a common data bus for data transmission between two separatedblocks. Accordingly, in the process of performing the neural network 2in which data movement and operation are continuously repeated,typically, a lot of time to transmit data is required and excessivepower may be consumed.

In one or more embodiments, an in-memory computing circuit may bedesired as an architecture for integrating memory and a computation unitperforming MAC operations into one, for example.

FIG. 3 is a diagram illustrating an in-memory computing circuit 3according to one or more embodiments.

In FIG. 3, the in-memory computing circuit 3 may include an analogcrossbar array 30 and an analog to digital converter (ADC) 40. However,only components related to the present embodiments are depicted in thein-memory computing circuit 3 illustrated in FIG. 3. Accordingly, itwill be apparent after an understanding of the disclosure of thisapplication that other components other than, or in addition to, thecomponents shown in FIG. 3 may further be included in the in-memorycomputing circuit 3.

The analog crossbar array 30 may include a plurality of row lines 310, aplurality of column lines 320, and a plurality of memory cells 330. Theplurality of row lines 310 may be used to receive input data. Forexample, when the plurality of row lines 310 is N (N is a naturalnumber) row lines, voltages V₁, V₂, . . . , V_(N) corresponding to inputactivations may be applied to the N row lines. The plurality of columnlines 320 may cross the plurality of row lines 310. For example, whenthe plurality of column lines 320 are M (M is a natural number) columnlines, the plurality of column lines 320 and the plurality of row lines310 may cross at N×M intersections.

In this example, a plurality of memory cells 330 may be disposed atintersections of the plurality of row lines 310 and the plurality ofcolumn lines 320. Each of the plurality of memory cells 330 may beimplemented as a nonvolatile memory, such as ReRAM (Resistive RAM), MRAM(Magnetic RAM), or eFlash to store weights, but is not limited thereto.Each of the plurality of memory cells 330 may be a volatile memory, suchas static random access memory (SRAM).

In the analog crossbar array 30 illustrated in FIG. 3, the plurality ofmemory cells 330 may include conductance G₁₁, . . . , G_(NM)corresponding to weights. When a voltage corresponding to an inputactivation is applied to each of the plurality of row lines 310,according to Ohm's law, a current having a size of I=V×G may be outputthrough each memory cell 330. Since currents output from the memorycells arranged in a column line are summed together, the current sumsI₁, . . . , I_(M) may be output along the plurality of column lines 320.The current sums I₁, . . . , I_(M) may correspond to the result of a MACoperation performed in an analog method.

The ADC 40 may convert the result of an analog MAC operation output fromthe analog crossbar array 30 (that is, the current sum I₁, . . . ,I_(M)) into a digital signal. The result of the MAC operation convertedto a digital signal is output from the ADC 40 and may be used in asubsequent neural network operation process.

On the other hand, the in-memory computing circuit 3, as shown in FIG.3, has the advantages of lower complexity of the core operation unit,less power consumption, and smaller circuit size than a digitalcomputer. However, in a process of mapping a synaptic weight to whichthousands or tens of thousands of neurons of the neural network modelare connected to the in-memory computing circuit 3, a limitation on aphysical size may occur. According to the present disclosure, a neuralnetwork device capable of operating a neural network at low power byusing the in-memory computing circuit 3 having various advantages whilesatisfying the constraint on the physical size may be provided.Hereinafter, an efficient structure and operation method of a neuralnetwork device according to the present embodiment will be described indetail with reference to the drawings.

FIG. 4 is a diagram illustrating a configuration of a processing blockincluded in a neural network device according to one or moreembodiments.

In FIG. 4, the neural network device may include a processing block 4.In FIG. 4, although only one processing block 4 is shown, the neuralnetwork device may include a plurality of processing blocks 4.Therefore, it will be apparent after an understanding of the disclosureof this application that other components other than, or in addition to,the components shown in FIG. 4 may further be included in the neuralnetwork device. For example, the neural network device may furtherinclude at least one control circuit 520.

At least one control circuit 520 may perform the overall function forcontrolling the neural network device. For example, the at least onecontrol circuit 520 may control the operation of the processing block 4.In this example, the at least one control circuit 520 may be implementedas an array of a plurality of logic gates, and may be implemented as acombination of a general-purpose microprocessor and a memory in which aprogram executable in a microprocessor is stored.

The processing block 4 may perform a MAC operation after receiving datafrom an external memory or an internal memory of a neural networkdevice, and may store a result of the MAC operation in a memory again.The processing block 4 may perform a pooling or activation functionoperation after completing a MAC operation with respect to one layer.

The processing block 4 may include a plurality of processing elements(Processing Element 0, . . . , Processing Element K). K represents anarbitrary natural number. Each of the K processing elements (ProcessingElement 0, . . . , Processing Element K) may include a plurality ofsub-processing elements. For example, as shown in FIG. 4, ProcessingElement 0 may include three sub-processing elements (Sub PE 0, Sub PE 1and Sub PE 2).

Each of the plurality of sub-processing elements may include a pluralityof crossbar arrays. For example, in a non-limiting example, the Sub PE 0may include 3 crossbar arrays (Crossbar Array 0, Crossbar Array 1 andCrossbar Array 2), and the Sub PE 1 may also include 3 crossbar arrays(Crossbar Array 3, Crossbar Array 4 and Crossbar Array 5), and the SubPE 2 may also include three crossbar arrays (Crossbar Array 6, CrossbarArray 7 and Crossbar Array 8). In this way, a preset number of crossbararrays may form one group, and one crossbar array group may correspondto a sub-processing element. In this example, since each of theplurality of crossbar arrays corresponds to the analog crossbar array 30of FIG. 3, the descriptions thereof will be omitted.

Nine crossbar arrays included in one processing element may be used fora 3×3 convolution operation, for example. Hereinafter, a detailedprocess of performing a 3×3 convolution operation by a neural networkdevice will be described in detail with reference to FIGS. 5 and 6.

For convenience of explanation of FIG. 4, an example in which oneprocessing element includes three sub-processing elements and onesub-processing element includes three crossbar arrays have beendescribed, as a non-limiting example. One processing element may includeless than or greater than three sub-processing elements, and onesub-processing element may include less than or greater than threecrossbar arrays. According to the configuration of the processing block4 included in a neural network, neural network operations (e.g.,convolution operations) of various sizes may be performed.

FIG. 5 is a diagram illustrating a circuit structure of a neural networkdevice according to one or more embodiments.

In FIG. 5, the neural network device may include a shift registercircuit 510, a control circuit 520, and a processing circuit 530. In theneural network device illustrated in FIG. 5, only components related tothe present embodiments are shown. Therefore, it will be apparent afteran understanding of the disclosure of this application that othercomponents other than, or in addition to, the components shown in FIG. 5may further be included in the neural network device.

The shift register circuit 510 may include a plurality of registers thattransfer stored data to the next register on every cycle and store newdata received from the previous register. At least some of the pluralityof registers included in the shift register circuit 510 may be connectedto a crossbar array group (sub-processing element) included in theprocessing circuit 530, and may transfer input data (for example, inputactivation) to a plurality of crossbar arrays included in the connectedcrossbar array group.

From among a plurality of registers included in the shift registercircuit 510, the number of registers that transfer input activation to aplurality of crossbar array groups (Sub PE 0, Sub PE 1, and Sub PE 2)may correspond to a height of a weight kernel. In an example, when a 3×3convolution, in which a height KH of a weight kernel is 3 and a width KWof the weight kernel is 3, is performed, the number of registers thattransfer input activation to a plurality of crossbar array groups (SubPE 0, Sub PE 1, and Sub PE 2) included in one processing element PE 0may be KH. Accordingly, when K input lines are processed by K processingelements, the total number of registers that transfer input activationto the processing circuit 530 may be KH*K. In this example, when a 3×3convolution is performed with respect to an input activation inputthrough K input lines, the number of output lines through which theoutput activation is output may correspond to K−2.

The control circuit 520 may sequentially input input activationsincluded in an input feature map to the shift register circuit 510according to a preset order. For example, the control circuit 520 maysequentially input input activations to the shift register circuit 510in a row direction of the input feature map. Input activations input tothe shift register circuit 510 may be sequentially shifted from thefirst register to the last register of the shift register circuit 510.

Also, the control circuit 520 may receive a 1-bit zero mark on everycycle, and, when the value of the zero mark is 1, may control aplurality of crossbar array groups (Sub PE 0, Sub PE 1, and Sub PE 2) soas to omit a MAC operation with respect to input activationscorresponding to the zero mark. The zero mark may be input together withinput feature map data of a row size, and may be used for at least oneof zero padding and zero skip. In an example, as the MAC operation ofcrossbar arrays (when a 3×3 convolution operation is performed, 3crossbar arrays) included in a crossbar array group corresponding to azero mark having a value of 1 among a plurality of crossbar array groups(Sub PE 0, Sub PE 1 and Sub PE 2), is omitted, power consumption mayfurther be reduced.

The processing circuit 530 may receive input activations from at leastone of a plurality of registers, and may include a plurality of crossbararray groups (Sub PE 0, Sub PE 1 and Sub PE 2) that perform a MACoperation with respect to the received input activation and weights. Aplurality of crossbar arrays (for example, Crossbar array 0, Crossbararray 1, and Crossbar array 2) included in one crossbar array group (forexample, Sub PE 0) among a plurality of crossbar array groups (Sub PE 0,Sub PE 1, and Sub PE 2) may share the same input activation.

Since the number of crossbar arrays included in one crossbar array groupcorresponds to a width of a weight kernel, the input activation may beshared among KW crossbar arrays. In one example, when a 3×3 convolutionoperation is performed, three crossbar arrays may receive the same inputactivation and calculate an output for three weight spaces (that is, aweight row having a size of 1×3). The output for each weight space maybe used to calculate different output activations from each other.

In this way, since the same input activation is shared among a pluralityof crossbar arrays, the input reuse efficiency may be significantlyincreased and a multiplexer (MUX) for processing input data is notrequired when compared to typical hardware devices, and thus, a hardwarestructure may be significantly simplified. Also, since a digital decoderand a control logic required to operate a crossbar array are sharedamong the plurality of crossbar arrays, an area of hardware may also bereduced.

The processing circuit 530 accumulates and adds at least some of theoperation results output from the plurality of crossbar array groups(Sub PE 0, Sub PE 1, and Sub PE 2) in units of a preset number ofcycles, and thus, may obtain an output activation included in an outputfeature map. For example, the processing circuit 530 selects at leastsome of the operation results output from the plurality of crossbararray groups (Sub PE 0, Sub PE 1, and Sub PE 2), converts the selectedoperation results into a 2's complement format, and accumulates and addsthe converted operation results, and thus, may obtain an outputactivation.

The processing circuit 530 may calculate a first output activation(e.g., Output data 0) by using an operation result output from onecrossbar array among a plurality of crossbar arrays (e.g., Crossbararray 0, Crossbar array 1, and Crossbar array 2) included in onecrossbar array group (e.g., Sub PE 0), and may calculate a second outputactivation (e.g., Output data 1) by using an operation result outputfrom another one crossbar array among the plurality of crossbar arrays(e.g., Crossbar array 0, Crossbar array 1, and Crossbar array 2). Eachof the operation results output from the plurality of crossbar arraysmay correspond to a partial sum for calculating an output activation.

An example process of obtaining an output activation included in anoutput feature map by the processing circuit 530 by accumulating andadding at least some of the calculation results output from a pluralityof crossbar array groups (Sub PE 0, Sub PE 1, and Sub PE 2) in units ofa preset number of cycles will be described in detail below in thediscussion of FIG. 6.

FIG. 6 is a diagram for explaining a process of performing a neuralnetwork operation by a neural network device according to one or moreembodiments.

In FIG. 6, an example in which the neural network device described withreference to FIG. 5 is configured to perform a 3×3 convolution operationwith respect to an input feature map having a size of 4×4, includinginput activations of X₀₀ to X₃₃ is shown. In one example, each of 9crossbar arrays (Xbar 0 to Xbar 8) used to perform a 3×3 convolutionoperation may include 128 column lines and 128 row lines, and the inputactivation may be data of 128 bits. However, this is only an example,and is not limited to the example.

Further, input activations included in an input feature map may besequentially input to a shift register circuit 510 on every cycle in arow direction. For example, input activations may be input to a shiftregister circuit 510 in the order of X₀₀, X₁₀, X₂₀, and X₃₀, and afterX₃₀ is input, input activations may be input to the shift registercircuit 510 in the order of X₀₁, X₁₁, X₂₁, and X₃₁. Also, evenafterwards, input activations up to X₃₃ may be sequentially input to theshift register circuit 510 in the same manner.

Since the first register of the shift register circuit 510 is connectedto the first crossbar array group (Xbar 0, Xbar 1, and Xbar 2), when X₀₀is input to the first register of the shift register circuit 510 in thefirst cycle (cycle 0), X₀₀ may be transferred to the first crossbararray group (Xbar 0, Xbar 1, and Xbar 2). Accordingly, the firstcrossbar array group (Xbar 0, Xbar 1, and Xbar 2) may perform a MACoperation using X₀₀ as an operand. Afterwards, X₀₀ may be transferred tothe next register in each cycle. In the fifth cycle (cycle 4), X₀₀ maybe transferred to a register connected to the second crossbar arraygroup (Xbar 3, Xbar 4, and Xbar 5). Accordingly, the second crossbararray group (Xbar 3, Xbar 4, and Xbar 5) may perform a MAC operationusing X₀₀ as an operand.

Further, FIG. 6 illustrates an operation process corresponding to aperiod from a ninth cycle (cycle 8) to a twelfth cycle (cycle 11) afterthe first cycle (cycle 0) in which X₀₀ is input to the shift registercircuit 510.

In the ninth cycle (cycle 8), the third crossbar array group (Xbar 6,Xbar 7 and Xbar 8) may perform a MAC operation on X₀₀, the secondcrossbar array group (Xbar 3, Xbar 4 and Xbar 5) may perform a MACoperation on X₀₁, and the first crossbar array group (Xbar 0, Xbar 1,and Xbar 2) may perform a MAC operation on X₀₂. The MAC operations onX₀₀, X₀₁, and X₀₂ may correspond to the MAC operation with respect tothe first row of the input feature map.

In the 10th cycle (cycle 9), the third crossbar array group (Xbar 6,Xbar 7 and Xbar 8) may perform a MAC operation on X₁₀, the secondcrossbar array group (Xbar 3, Xbar 4 and Xbar 5) may perform a MACoperation on X₁₁, and the first crossbar array group (Xbar 0, Xbar 1,and Xbar 2) may perform a MAC operation on X₁₂. The MAC operations onX₁₀, X₁₁, and X₁₂ may correspond to the MAC operation with respect tothe second row of the input feature map.

In the 11th cycle (cycle 10), the third crossbar array group (Xbar 6,Xbar 7 and Xbar 8) may perform a MAC operation on X₂₀, the secondcrossbar array group (Xbar 3, Xbar 4 and Xbar 5) may perform a MACoperation on X₂₁, and the first crossbar array group (Xbar 0, Xbar 1,and Xbar 2) may perform a MAC operation on X₂₂. The MAC operations onX₂₀, X₂₁, and X₂₂ may correspond to the MAC operation with respect tothe third row of the input feature map.

In the 12th cycle (cycle 11), the third crossbar array group (Xbar 6,Xbar 7 and Xbar 8) may perform a MAC operation on X₃₀, and the secondcrossbar array group (Xbar 3, Xbar 4 and Xbar 5) may perform a MACoperation on X₃₁, and the first crossbar array group (Xbar 0, Xbar 1,and Xbar 2) may perform a MAC operation on X₃₂. The MAC operations onX₃₀, X₃₁, and X₃₂ may correspond to the MAC operation with respect tothe fourth row of the input feature map.

Further, in the ninth cycle (cycle 8), operation results by Xbar 0, Xbar3, and Xbar 6 may be selected among operation results output from thecrossbar array groups, in the tenth cycle (cycle 9), operation resultsby Xbar 1, Xbar 4 and Xbar 7 may be selected, and in the 11th cycle(cycle 10), operation results by Xbar 2, Xbar 5 and Xbar 8 may beselected. The selected operation results may be converted into a 2'scomplement format, and then, accumulated and added by a firstaccumulator (ACCUM 0), and accordingly, a final result (that is, a firstoutput activation) of a 3×3 convolution operation corresponding to afirst region 610 may be output.

Also, in the 10th cycle (cycle 9), operation results by Xbar 0, Xbar 3,and Xbar 6 may be selected among operation results output from thecrossbar array groups, in the 11th cycle (cycle 10), operation resultsby Xbar 1, Xbar 4 and Xbar 7 may be selected, and in the 12th cycle(cycle 11), operation results by Xbar 2, Xbar 5 and Xbar 8 may beselected. The selected operation results may be converted into a 2'scomplement format, and then, accumulated and added by a secondaccumulator (ACCUM 1), and accordingly, a final result (that is, asecond output activation) of a 3×3 convolution operation correspondingto a second region 620 may be output.

In this manner, two output activations may be output through two outputlines during four cycles. Accordingly, compared to a neural networkdevice of the related art in which two output activations are outputduring two cycles, an output bandwidth may be reduced. Also, a neuralnetwork device of the related art uses a bandwidth of 4×128 bits for twocycles. However, the neural network device according to the presentembodiment uses only a bandwidth of 128 bits for one cycle, thus, aninput bandwidth may be reduced to half. In this way, as the outputbandwidth and the input bandwidth are reduced, power consumption may bereduced.

In FIG. 6, an example in which the number of processing elements is 4,the number of sub-processing elements is 3, and the number of crossbararrays is 9 is described, but this is only an example. The number ofoutput lines may be adjusted as the number of crossbar arrays includedin one sub-processing element is adjusted, and neural network operationswith respect to input feature maps or weight kernels of various sizesmay be performed as the number of processing elements or the number ofsub-processing elements is adjusted.

In one example, a first processing element that outputs an outputactivation through four output lines during six cycles may beimplemented by adjusting the size or depth of the shift register circuit510. In this case, the first processing element may be directlyconnected to the second processing element that outputs outputactivations through two output lines during four cycles, as describedwith reference to FIG. 6. The output lines of the first processingelement correspond to an output of one of the plurality of layersconstituting a neural network, and may be directly connected to theinput lines of the second processing element corresponding to the nextlayer. As described above, according to the present embodiment,connections between layers may be implemented without additional memoryor additional digital logic. Also, since an operation of reading/writingthe input/output to the memory is omitted, power consumption may begreatly reduced.

Further, the next layer including input lines directly connected to theplurality of output lines of one layer may include at least one of aconvolution layer and a pooling layer. The operation processcorresponding to the convolutional layer has already been described withreference to FIG. 6, and thus, hereinafter, an example operation processcorresponding to the pooling layer will be described in greater detailwith reference to FIG. 7.

FIG. 7 is a diagram explaining a process of performing pooling andactivation function operations by a neural network device according toone or more embodiments.

In FIG. 7, a process of performing pooling and activation functionoperations with reference to an output feature map 70 output by a neuralnetwork operation described with reference to FIG. 6 is illustrated asan example. A first row 710 of the output feature map 70 may correspondto output activations output through a first output line, and a secondrow 720 may correspond to output activations output through a secondoutput line.

Output activations included in the first row 710 and output activationsincluded in the second row 720 may be input to one of the plurality ofpooling registers 730 a to 730 d. As an example, in the case when theneural network operation described above with reference to FIG. 6, twooutput activations are output through two output lines during fourcycles. Accordingly, after x₀₀ and x₁₀ output in cycle 0 arerespectively input to the pooling register 730 a and the poolingregister 730 c, x ₀₀, and x₁₁ outputted after four cycles (that is, incycle 4) may be respectively input to the pooling register 730 a and thepooing register 730 c.

As x₀₁ and x₁₁ are newly input to the pooling register 730 a and thepooling register 730 c, respectively, x₀₀ and x₁₀ stored in the poolingregister 730 a and the pooling register 730 c may be transferred to apooling register 730 b and a pooling register 730 d, respectively.Accordingly, the pooling register 730 a may store x₀₁, the poolingregister 730 b may store x₀₀, the pooling register 730 c may store x₁₁,and the pooling register 730 d may store x₁₀.

A pooling operator 740 may perform a 2×2 pooling operation afterreceiving output activations from the plurality of pooling registers 730a to 730 d. Accordingly, the result of the pooling operation for x₀₀,x₁₀, x₀₁, and x₁₁ may be output in cycle 5. The pooling operation may bemax pooling, average pooling, L2-norm pooling, etc., but is not limitedthereto. In an example, when the pooling operation corresponds to maxpooling, a maximum value among x₀₀, x₁₀, x₀₁, and x₁₁ may be output fromthe pooling operator 740.

An activation function 750 may apply an activation function to theresult of the pooling operation received from the pooling operator 740.Accordingly, in cycle 6, a final output to which the activation functionis applied may be output. Afterwards, in cycle 8, new output activationsare output through the output lines, and the process described above maybe repeated. An overall timing diagram of a process in which the neuralnetwork device is configured to perform pooling and activation functionoperations is shown in table 760.

In this way, the neural network device according to the presentembodiment may directly connect output lines of a convolution layer tothe pooling layer without an additional buffer. In FIG. 7, forconvenience of explanation, a 2×2 pooling with a stride of 2 has beenshown, but is not limited thereto. Pooling operations of various sizesmay be performed according to the structure of a pooling layer.

FIG. 8 is a block diagram illustrating a configuration of an electronicsystem according to one or more embodiments.

In FIG. 8, the electronic system 80 may extract valid information byanalyzing input data in real time based on a neural network anddetermine a situation or control the configuration of a device includingthe electronic system 80 based on the extracted information, noting theelectronic device and the discussed neural network are alsorepresentative of another one of such a device. For example, theelectronic system 80 may be applied to, or representative of, a roboticdevice, such as a drone or an advanced driver assistance system (ADAS),a smart TV, a smart phone, a medical device, a mobile device, an imagedisplay device, a measurement device, an IoT device and various othertypes of electronic devices, as non-limiting examples.

The electronic system 80 may include a processor 810, a RAM 820, aneural network device 830, a memory 840, a sensor module 850, and acommunication module 860. The electronic system 80 may further includean input/output module, a security module, and a power control device.Some of hardware components of the electronic system 80 may be mountedon at least one semiconductor chip.

The processor 810 controls an overall operation of the electronic system80. The processor 810 may include a single processor core (Single Core)or a plurality of processor cores (Multi-Core). The processor 810 mayprocess or execute instructions and/or data stored in the memory 840. Inone or more embodiments, the processor 810 may control functions of theneural network device 830 by executing instructions stored in the memory840. The processor 810 may be implemented by a central processing unit(CPU), a graphics processing unit (GPU), an application processor (AP),etc.

The RAM 820 may temporarily store instructions, data, or instructions.For example, instructions and/or data stored in the memory 840 may betemporarily stored in the RAM 820 according to the control or bootingcode of the processor 810. The RAM 820 may be implemented as a memory,such as dynamic RAM (DRAM), static RAM (SRAM), etc.

The neural network device 830 may perform an operation of the neuralnetwork based on received input data and generate an information signalbased on the execution result. Neural networks may include convolutionneural networks (CNN), recurrent neural networks (RNN), deep beliefnetworks, restricted Boltzmann machines, etc., but are not limitedthereto. The neural network device 830 may be a hardware acceleratordedicated to the neural network or a device including the same, and maycorrespond to the neural network device described above with referenceto FIGS. 4 to 7.

The neural network device 830 may control a plurality of crossbar arraysso that the plurality of crossbar arrays share and process the sameinput data by using a shift register circuit 510, and select at leastsome of operation results output from the plurality of crossbar arrays.Also, the neural network device 830 may acquire a final output byaccumulating and adding the selected operation results in units of apreset number of cycles. Accordingly, input reuse is increased and thenumber of memory access is decreased when compared to typical hardwaredevices, and thus, power consumption for driving the neural networkdevice 830 may be reduced.

An information signal may include one of various types of recognitionsignals, such as a voice recognition signal, an object recognitionsignal, an image recognition signal, and a biometric informationrecognition signal. For example, the neural network device 830 mayreceive frame data included in a video stream as input data andgenerate, on the basis of the frame data, a recognition signal withrespect to an object included in an image displayed by the frame data.However, examples are not limited thereto, and the neural network device830 may receive various types of input data according to the type orfunction of an electronic device on which the electronic system 80 ismounted, and alternatively also representative of, and generate arecognition signal according to the input data.

The memory 840 is a storage for storing data and may store an operatingsystem (OS), various instructions, programs, and various data. In anembodiment, the memory 840 may store intermediate results generated in aprocess of performing an operation of the neural network device 830.

The memory 840 may be DRAM, but is not limited thereto. The memory 840may include at least one of volatile memory and nonvolatile memory. Thenon-volatile memory includes ROM, PROM, EPROM, EEPROM, flash memory,PRAM, MRAM, RRAM, FRAM, etc. The volatile memory includes DRAM, SRAM,SDRAM, PRAM, MRAM, RRAM, FeRAM, etc. In an embodiment, the memory 840may include at least one of HDD, SSD, CF, SD, Micro-SD, Mini-SD, xD andMemory Stick.

The sensor module 850 may collect information around an electronicdevice on which the electronic system 80 is mounted. The sensor module850 may sense or receive a signal (e.g., an image signal, a voicesignal, a magnetic signal, a bio signal, a touch signal, etc.) from theoutside of the electronic device and convert the sensed or receivedsignal into data. To this end, the sensor module 850 may include atleast one of various types of sensing devices, for example, amicrophone, an imaging device, an image sensor, a light detection andranging (LiDAR) sensor, an ultrasonic sensor, an infrared sensor, a biosensor, and a touch sensor.

The sensor module 850 may provide converted data as input data to theneural network device 830. For example, the sensor module 850 mayinclude an image sensor, generate a video stream by photographing anexternal environment of the electronic device, and sequentially providesuccessive data frames of the video stream to the neural network device830 as input data. However, the present embodiment is not limitedthereto, and the sensor module 850 may provide various types of data tothe neural network device 830.

The communication module 860 may include various wired or wirelessinterfaces capable of communicating with external devices. For example,the communication module 860 may include a local area network (LAN), awireless local area network (WLAN), such as Wi-Fi, a wireless personalarea network (WPAN), such as Bluetooth, a wireless universal serial bus(USB), ZigBee, near-field communication (NFC), radio-frequencyidentification (RFID), power-line communication (PLC), or acommunication interface capable of connecting to a mobile cellularnetwork, such as 3rd generation (3G), 4th generation (4G), long-termevolution (LTE), or 5th generation (5G).

FIG. 9 is a flowchart illustrating an operating method of a neuralnetwork device according to one or more embodiments.

In FIG. 9, a method of operating a neural network device includesoperations processed in a time series in the neural network deviceillustrated in FIGS. 4 to 7. Accordingly, it may be seen that eventhough omitted below, the descriptions given with respect to FIGS. 4 to7 may also be applied to the operating method of the neural networkdevice of FIG. 9.

In operation 910, a neural network device may sequentially input inputactivations included in an input feature map into a shift registercircuit 510 according to a preset order. The shift register circuit 510may include a plurality of registers that transfer stored data to thenext register in every cycle and store new data received from theprevious register.

In an example, the neural network device may sequentially input inputactivations to the shift register circuit 510 in a row direction of theinput feature map. The input activations input to the shift registercircuit 510 may be sequentially shifted from the first register to thelast register of the shift register circuit 510.

On the other hand, the neural network device may receive a 1-bit zeromark in every cycle and, when the value of the zero mark is 1, maycontrol a plurality of crossbar array groups to omit a MAC operationwith respect to the input activations corresponding to the zero mark.The zero mark may be input to the shift register circuit 510 togetherwith row-sized input activations, or may be separately stored. However,the present embodiment is not limited thereto.

In operation 920, the neural network device may receive an inputactivation from at least one of a plurality of registers included in theshift register circuit 510 and perform a MAC operation on the receivedinput activation and weights, by using a plurality of crossbar arraygroups. A plurality of crossbar arrays included in one crossbar arraygroup among a plurality of crossbar array groups may share the sameinput activation. Accordingly, the input reuse efficiency may besignificantly increased and a MUX for processing input data is notrequired, and thus, a hardware structure may be significantlysimplified. Also, since a digital decoder and control logic required tooperate the crossbar array are shared among the plurality of crossbararrays, an area of hardware may be reduced when compared to typicalhardware devices.

In this example, the number of crossbar arrays included in one crossbararray group corresponds to a width of a weight kernel, and the number ofregisters that transfer an input activation to the plurality of crossbararray groups among the plurality of registers may correspond to a heightof the weight kernel. The circuit structure of a neural network may beadjusted to perform a neural network operation corresponding to the sizeof the weight kernel.

In operation 930, the neural network device may obtain an outputactivation included in an output feature map by accumulating and addingat least some of the operation results output from the plurality ofcrossbar array groups in units of a preset number of cycles. Forexample, the neural network device selects at least some of theoperation results output from the plurality of crossbar array groups,converts the selected operation results into a 2's complement format,and accumulates and adds the converted operation results, and thus, mayobtain an output activation.

The neural network device may calculate a first output activation byusing an operation result output from one crossbar array among aplurality of crossbar arrays, and calculate a second output activationby using an operation result output from another crossbar array amongthe plurality of crossbar arrays.

The processing block, processing block 4, control circuit, processingelements, sub-processing elements, crossbar arrays, pooling registers,pooling operator, activation function, shift register circuit 510,control circuit 520, processing circuit 530, electronic system 80,processor 810, RAM 820, neural network device 830, memory 840, sensormodule 850, and a communication module 860 in FIGS. 1-9 that perform theoperations described in this application are implemented by hardwarecomponents configured to perform the operations described in thisapplication that are performed by the hardware components. Examples ofhardware components that may be used to perform the operations describedin this application where appropriate include controllers, sensors,generators, drivers, memories, comparators, arithmetic logic units,adders, subtractors, multipliers, dividers, integrators, and any otherelectronic components configured to perform the operations described inthis application. In other examples, one or more of the hardwarecomponents that perform the operations described in this application areimplemented by computing hardware, for example, by one or moreprocessors or computers. A processor or computer may be implemented byone or more processing elements, such as an array of logic gates, acontroller and an arithmetic logic unit, a digital signal processor, amicrocomputer, a programmable logic controller, a field-programmablegate array, a programmable logic array, a microprocessor, or any otherdevice or combination of devices that is configured to respond to andexecute instructions in a defined manner to achieve a desired result. Inone example, a processor or computer includes, or is connected to, oneor more memories storing instructions or software that are executed bythe processor or computer. Hardware components implemented by aprocessor or computer may execute instructions or software, such as anoperating system (OS) and one or more software applications that run onthe OS, to perform the operations described in this application. Thehardware components may also access, manipulate, process, create, andstore data in response to execution of the instructions or software. Forsimplicity, the singular term “processor” or “computer” may be used inthe description of the examples described in this application, but inother examples multiple processors or computers may be used, or aprocessor or computer may include multiple processing elements, ormultiple types of processing elements, or both. For example, a singlehardware component or two or more hardware components may be implementedby a single processor, or two or more processors, or a processor and acontroller. One or more hardware components may be implemented by one ormore processors, or a processor and a controller, and one or more otherhardware components may be implemented by one or more other processors,or another processor and another controller. One or more processors, ora processor and a controller, may implement a single hardware component,or two or more hardware components. A hardware component may have anyone or more of different processing configurations, examples of whichinclude a single processor, independent processors, parallel processors,single-instruction single-data (SISD) multiprocessing,single-instruction multiple-data (SIMD) multiprocessing,multiple-instruction single-data (MISD) multiprocessing, andmultiple-instruction multiple-data (MIMD) multiprocessing.

The methods illustrated in FIGS. 1-9 that perform the operationsdescribed in this application are performed by computing hardware, forexample, by one or more processors or computers, implemented asdescribed above executing instructions or software to perform theoperations described in this application that are performed by themethods. For example, a single operation or two or more operations maybe performed by a single processor, or two or more processors, or aprocessor and a controller. One or more operations may be performed byone or more processors, or a processor and a controller, and one or moreother operations may be performed by one or more other processors, oranother processor and another controller. One or more processors, or aprocessor and a controller, may perform a single operation, or two ormore operations.

Instructions or software to control computing hardware, for example, oneor more processors or computers, to implement the hardware componentsand perform the methods as described above may be written as computerprograms, code segments, instructions or any combination thereof, forindividually or collectively instructing or configuring the one or moreprocessors or computers to operate as a machine or special-purposecomputer to perform the operations that are performed by the hardwarecomponents and the methods as described above. In one example, theinstructions or software include machine code that is directly executedby the one or more processors or computers, such as machine codeproduced by a compiler. In another example, the instructions or softwareincludes higher-level code that is executed by the one or moreprocessors or computer using an interpreter. The instructions orsoftware may be written using any programming language based on theblock diagrams and the flow charts illustrated in the drawings and thecorresponding descriptions in the specification, which disclosealgorithms for performing the operations that are performed by thehardware components and the methods as described above.

The instructions or software to control computing hardware, for example,one or more processors or computers, to implement the hardwarecomponents and perform the methods as described above, and anyassociated data, data files, and data structures, may be recorded,stored, or fixed in or on one or more non-transitory computer-readablestorage media. Examples of a non-transitory computer-readable storagemedium include read-only memory (ROM), random-access memory (RAM), flashmemory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs,DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, magnetictapes, floppy disks, magneto-optical data storage devices, optical datastorage devices, hard disks, solid-state disks, and any other devicethat is configured to store the instructions or software and anyassociated data, data files, and data structures in a non-transitorymanner and provide the instructions or software and any associated data,data files, and data structures to one or more processors or computersso that the one or more processors or computers can execute theinstructions. In one example, the instructions or software and anyassociated data, data files, and data structures are distributed overnetwork-coupled computer systems so that the instructions and softwareand any associated data, data files, and data structures are stored,accessed, and executed in a distributed fashion by the one or moreprocessors or computers.

While this disclosure includes specific examples, it will be apparentafter an understanding of the disclosure of this application thatvarious changes in form and details may be made in these exampleswithout departing from the spirit and scope of the claims and theirequivalents. The examples described herein are to be considered in adescriptive sense only, and not for purposes of limitation. Descriptionsof features or aspects in each example are to be considered as beingapplicable to similar features or aspects in other examples. Suitableresults may be achieved if the described techniques are performed in adifferent order, and/or if components in a described system,architecture, device, or circuit are combined in a different manner,and/or replaced or supplemented by other components or theirequivalents. Therefore, the scope of the disclosure is defined not bythe detailed description, but by the claims and their equivalents, andall variations within the scope of the claims and their equivalents areto be construed as being included in the disclosure.

What is claimed is:
 1. A neural network device comprising: a shift register circuit comprising registers configured to, in each cycle of plural cycles, transfer stored data to a next register and store new data received from a previous register; a control circuit configured to sequentially input data of input activations included in an input feature map into the shift register circuit in a preset order; and a processing circuit, comprising crossbar array groups that receive input activations from at least one of the registers and perform a multiply-accumulate (MAC) operation with respect to the received input activation and weights, configured to accumulate and add at least some operation results output from the crossbar array groups in units of a preset number of cycles to obtain an output activation in an output feature map.
 2. The neural network device of claim 1, wherein the control circuit is further configured to receive a 1-bit zero mark on each of the plural cycles, and in response to the value of the zero mark being 1, to control the crossbar array groups to omit a MAC operation with respect to input activations corresponding to the zero mark.
 3. The neural network device of claim 1, wherein crossbar arrays included in one crossbar array group of the crossbar array groups share a same input activation.
 4. The neural network device of claim 3, wherein each of the crossbar arrays comprises: a plurality of row lines; a plurality of column lines intersecting the plurality of row lines; and memory cells respectively disposed at the intersections of the plurality of row lines and the plurality of column lines, and configured to store the weights included in a weight kernel.
 5. The neural network device of claim 3, wherein the processing circuit is further configured to obtain a first output activation using an operation result output from one of the crossbar arrays, and obtain a second output activation using an operation result output from another of the crossbar arrays.
 6. The neural network device of claim 3, wherein a number of the crossbar arrays included in the one crossbar array group corresponds to a width of a weight kernel.
 7. The neural network device of claim 1, wherein a number of registers that transfer input activation to the crossbar array groups from the registers corresponds to a height of a weight kernel.
 8. The neural network device of claim 1, wherein the processing circuit is further configured to select at least some of the operation results output from the crossbar array groups, convert the selected operation results into a 2's complement format, and accumulate and add the converted operation results to obtain the output activation.
 9. The neural network device of claim 1, wherein the processing circuit comprises an output line through which the output activation is output, and the output line corresponds to an output of one of a plurality of layers constituting a neural network, and is directly connected to an input line of a next layer.
 10. The neural network device of claim 9, wherein the next layer comprises either one or both of a convolution layer and a pooling layer.
 11. A method of a neural network device, the method comprising: sequentially inputting input activations included in an input feature map into a shift register circuit in a preset order; receiving an input activation of the input activations from at least one of a plurality of registers of the shift register circuit from a corresponding crossbar array group of crossbar array groups and performing a multiply-accumulate (MAC) operation on the received input activation and weights; and obtaining an output activation included in an output feature map by accumulating and adding at least some of the calculation results output from the crossbar array groups in units of a preset number of cycles.
 12. The method of claim 11, further comprising: receiving a 1-bit zero mark on each cycle of the sequentially inputting of the input activations; and in response to the value of the zero mark being 1, controlling the crossbar array groups to omit the MAC operation with respect to input activations corresponding to the zero mark.
 13. The method of claim 11, wherein crossbar arrays included in one crossbar array group of the crossbar array groups share a same input activation
 14. The method of claim 13, wherein each of the crossbar arrays comprises: a plurality of row lines; a plurality of column lines intersecting the plurality of row lines; and memory cells respectively disposed at the intersections of the plurality of row lines and the plurality of column lines, and configured to store the weights of a weight kernel.
 15. The method of claim 13, further comprising: obtaining a first output activation using an operation result output from one of the crossbar arrays; and obtaining a second output activation using an operation result output from another crossbar array of the crossbar arrays.
 16. The method of claim 13, wherein a number of the crossbar arrays included in the one crossbar array group corresponds to a width of a weight kernel.
 17. The method of claim 11, wherein a number of registers that transfer input activation to the crossbar array groups from the plurality of registers corresponds to a height of a weight kernel.
 18. The method of claim 11, wherein the obtaining the output activation comprises: selecting at least some operation results output from the crossbar array groups; converting the selected operation results into a 2's complement format; and accumulating and adding the converted operation results.
 19. The method of claim 11, further comprising outputting the output activation via an output line, wherein the output line corresponds to an output of one of a plurality of layers constituting a neural network, and is directly connected to an input line of a next layer.
 20. The method of claim 19, wherein the next layer comprises either one or both of a convolutional layer and a pooling layer. 